Improve MPT fp8 #1256

atakaha · 2024-08-14T22:12:31Z

Add Softmax and FusedSDPA
Fix unnecessary args from self._gradient_checkpointing_func() call.

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

atakaha · 2024-08-14T22:55:16Z

observing new token difference between orinal and use_flash_attention. Because of SDPA implementation difference, specially attn_mask, between MPT and itorch.nn.functional.scaled_dot_product_attention,
observing new token difference between original and softmax. This could be cause by precision difference, since softmax call torch.ops.hpu.softmax_fp8
bf16 sample command line (use softmax)

python run_generation.py --model_name_or_path mosaicml/mpt-7b --use_hpu_graphs --use_kv_cache --limit_hpu_graphs --max_input_tokens 128 --max_new_tokens 128 --batch_size 256 --bf16

bf16 with flash attention sample command line (doesn't use softmax)

python run_generation.py --model_name_or_path mosaicml/mpt-7b --use_hpu_graphs --use_kv_cache --limit_hpu_graphs --max_input_tokens 128 --max_new_tokens 128 --batch_size 256 --bf16 --use_flash_attention

throughput sample

config	batch size \|max input tokens \| max new tokens	Throughput (including tokenization)(tokens/s)	HPU graphs	Memory allocated(GB)	Max memory allocated(GB)	TP improvement vs bf16
org bf16	128 \| 128 \| 128	5076	14	45.3	83.51	N/A
bf16 + flash_attention	ditto	5150	14	46.3	84.48	1.4%
bf16 + softmax	ditto	5148	14	46.3	84.48	1.4%
fp8 + flash_attention	ditto	7091	21	40.1	78.29	39.7%
fp8 + softmax	ditto	6921	22	38.85	77.04	36.3%
org bf16	32 \| 128 \| 1024	1604	14	33.31	54.79	N/A
bf16 + flash_attention	ditto	1700	14	35.84	57.32	1.5%
bf16 + softmax	ditto	1726	14	33.31	54.79	1.5%
fp8 + flash_attention	ditto	2038	21	29.64	51.12	19.9%
fp8 + softmax	ditto	2084	22	27.12	48.62	22.6%

optimum/habana/transformers/models/mpt/modeling_mpt.py

tthakkal · 2024-08-15T00:08:57Z

optimum/habana/transformers/models/mpt/modeling_mpt.py

+        flash_attention_recompute: Optional[bool] = False,
+    ):
+        """
+        Copied from MptAttention.forward: https://github.com/huggingface/transformers/blob/v4.32.0/src/transformers/models/mpt/modeling_mpt.py


At least part of the code looks like copied from newer version than v4.32.0, could you verify and update this comment.

This line is original, line 123

still please update to latest as we are copying code from latest

merge MptAttention forward r4.44.1

Add Softmax and FusedSDPA Update GaudiMptAttention foward to r4.44.1 base Co-authored-by: Thanaji Rao Thakkalapelli <[email protected]>

HuggingFaceDocBuilderDev · 2024-08-29T14:17:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

atakaha · 2024-09-03T15:54:40Z

@regisss, @mandy-li , Please review this PR.

mandy-li · 2024-09-03T15:56:30Z

@atakaha , pls use make style to fix code format issue

…uggingface#1283)

…ce#1273)

…etes example (huggingface#1286) Signed-off-by: dmsuehir <[email protected]>

…ngface#1288)

…1301)

Add Softmax and FusedSDPA Update GaudiMptAttention foward to r4.44.1 base Co-authored-by: Thanaji Rao Thakkalapelli <[email protected]>

atakaha · 2024-09-03T21:55:15Z

@atakaha , pls use make style to fix code format issue
@mandy-li,
make style issue is fixed.

mandy-li · 2024-09-05T22:46:50Z

@libinta , please have somebody review this PR.thanks

optimum/habana/transformers/models/mpt/modeling_mpt.py

jiminha · 2024-09-06T21:09:37Z

optimum/habana/transformers/models/mpt/modeling_mpt.py

+            attn_weights = None
+        else:
+            attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2)) * self.softmax_scale
+


You should also use fp8 matmul kernel. Please check falcon code below.
https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/models/falcon/modeling_falcon.py#L108

I tested it and throughput slowdown. The trace showed if MPT use it then it doesn't use fp8, but torch.matmul use fp8. That's the why MPT doesn't use the fp8 matmul kernel this time.

That's weird. Wondering if fp8_sofmax actually forcing matmul to also run with fp8. Do you see a good accuracy with fp8 softmax?

class Matmul(torch.nn.Module): def __init__(self): super().__init__() def forward(self, x, y): return torch.matmul(x, y)

Compare profile data this between FP8, for GPT-J, Mistral and MPT.

GPT-J and Mistral spend <70% for MME but MPT spend ~30%.

GPT-J and Mistral calling index_copy_fwd_hf8 and cast_bf16_to_hf8 kernels but doesn't appear these kernel call in MPT. When add matmul_qk and matmul_qv in blocklist of maxabs_quant.json for MPT then it call same kernels of GPT-J/Mistral.
This is reason we didn't add this.

Also, can you make sure your change doesn't break anything for the training case since model file is used for both.

jiminha · 2024-09-06T21:48:54Z

optimum/habana/transformers/models/mpt/modeling_mpt.py

+        if use_flash_attention and FusedSDPA:
+            import habana_frameworks.torch.hpu as ht
+
+            with ht.sdp_kernel(enable_recompute=flash_attention_recompute):


Can you also check other model if this enable_recompute should be set based on the fp8/bf16 and q_len.

If set enable_recompute same way of other models, then FP8 throughput drop to half. In the trace, softmax_stage1_fwd_f32 appears and spending a lot of time, There is not this process at enable_recompute = False case.

what's your fp8 command with flash_attention?

Did you test any case which enable causal_mask and enable_compute?
How about longer prompt? Usually causal_mask shows better perf for long prompt

what's your fp8 command with flash_attention?

The command line is

QUANT_CONFIG=./quantization_config/maxabs_quant.json \ python run_generation.py \ --model_name_or_path mosaicml/mpt-7b \ --use_hpu_graphs \ --use_kv_cache \ --limit_hpu_graphs \ --max_input_tokens 128 \ --max_new_tokens 128 \ --batch_size 128 \ --bf16 \ --use_flash_attention

Did you test any case which enable causal_mask and enable_compute? How about longer prompt? Usually causal_mask shows better perf for long prompt

No, I haven't test it these cases.

tthakkal · 2024-09-11T19:10:34Z

optimum/habana/transformers/models/mpt/modeling_mpt.py

+        super().__init__()
+
+    def forward(self, x, dim=None, invAttnHead=None):
+        return torch.ops.hpu.softmax_fp8(x, dim, None, None, invAttnHead)


Since INC is enabled, please use torch.nn.functional.softmax as its supported module by INC for quantization.
https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-functions

Suggested change

return torch.ops.hpu.softmax_fp8(x, dim, None, None, invAttnHead)

return torch.nn.functional.softmax(x, dim)

updated the code with your suggestion.

jiminha

Looks good.
Could you investigate further though to see if we need causal_mask and recompute enabled differently for long prompt and submit as a separate patch if needed?

github-actions · 2024-09-12T08:11:36Z

The code quality check failed, please run make style.

atakaha · 2024-09-12T15:25:10Z

The code quality check failed, please run make style.

fixed ruff error.

atakaha · 2024-09-12T15:26:48Z

Looks good. Could you investigate further though to see if we need causal_mask and recompute enabled differently for long prompt and submit as a separate patch if needed?

Sure, I will.

atakaha · 2024-09-16T23:46:39Z

@regisss, Please review this PR.

atakaha requested review from mandy-li and regisss as code owners August 14, 2024 22:12

tthakkal reviewed Aug 15, 2024

View reviewed changes

optimum/habana/transformers/models/mpt/modeling_mpt.py Outdated Show resolved Hide resolved

optimum/habana/transformers/models/mpt/modeling_mpt.py Outdated Show resolved Hide resolved

tthakkal reviewed Aug 15, 2024

View reviewed changes

atakaha force-pushed the mpt_fp8 branch from 3337fd4 to 30d8302 Compare August 15, 2024 01:01

atakaha marked this pull request as draft August 15, 2024 01:09

atakaha force-pushed the mpt_fp8 branch 2 times, most recently from 30d8302 to f7704e4 Compare August 15, 2024 01:22

atakaha closed this Aug 15, 2024

atakaha reopened this Aug 15, 2024

atakaha marked this pull request as ready for review August 15, 2024 15:21

atakaha force-pushed the mpt_fp8 branch 3 times, most recently from 5f901df to b3f729f Compare August 21, 2024 19:05

atakaha force-pushed the mpt_fp8 branch from b3f729f to ad74504 Compare August 27, 2024 00:50

tthakkal approved these changes Aug 27, 2024

View reviewed changes

Enable MPT fp8 support

b9587ca

Add Softmax and FusedSDPA Update GaudiMptAttention foward to r4.44.1 base Co-authored-by: Thanaji Rao Thakkalapelli <[email protected]>

atakaha force-pushed the mpt_fp8 branch from ad74504 to b9587ca Compare August 27, 2024 17:52

tthakkal and others added 8 commits September 3, 2024 09:09

Fix cache position issue in mixtral (huggingface#1272)

51ddf87

Add temporary directories to test_trainer.py

60ae80b

Fix memory regression for modeling llama (huggingface#1271)

18249d4

Fix profiling step with device finish execution for text-generation (h…

0d3e0f4

…uggingface#1283)

Revert mark_step in mixtral model from PR huggingface#1260 (huggingfa…

18efdc1

…ce#1273)

Remove huggingface_hub install that is no longer needed in the kubern…

7891a86

…etes example (huggingface#1286) Signed-off-by: dmsuehir <[email protected]>

Add missing condtion check in tensor creation in greedy search (huggi…

5566721

…ngface#1288)

Fix BERT FSDP test (huggingface#1281)

a909e6b

regisss and others added 6 commits September 3, 2024 09:09

Potential fix 3 for failed code quality check workflow

9a29cc2

Other potentiel fix

46c2d59

New potential fix

f9d46eb

Enabling Text to Video Diffusion Model Generation (huggingface#1109)

e7d62b3

Prevent Graph break in Llama when using flash attention (huggingface#…

fe8ae86

…1301)

Enable MPT fp8 support

dc7d72e

Add Softmax and FusedSDPA Update GaudiMptAttention foward to r4.44.1 base Co-authored-by: Thanaji Rao Thakkalapelli <[email protected]>

atakaha requested review from ssarkar2, bhargaveede, vivekgoe, libinta, dvarshney-habana and ZhaiFeiyue as code owners September 3, 2024 16:20

tthakkal approved these changes Sep 3, 2024

View reviewed changes

jiminha reviewed Sep 6, 2024

View reviewed changes

atakaha force-pushed the mpt_fp8 branch from 18f9731 to 34a4e83 Compare September 9, 2024 20:22

tthakkal reviewed Sep 11, 2024

View reviewed changes

atakaha force-pushed the mpt_fp8 branch from 34a4e83 to cdbce69 Compare September 11, 2024 22:36

jiminha approved these changes Sep 12, 2024

View reviewed changes

jiminha added the run-test Run CI for PRs from external contributors label Sep 12, 2024

tthakkal approved these changes Sep 12, 2024

View reviewed changes

Merge branch 'huggingface:main' into mpt_fp8

46446b8

atakaha force-pushed the mpt_fp8 branch from cdbce69 to 46446b8 Compare September 12, 2024 15:21

regisss approved these changes Sep 23, 2024

View reviewed changes

regisss merged commit b75216c into huggingface:main Sep 23, 2024
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MPT fp8 #1256

Improve MPT fp8 #1256

atakaha commented Aug 14, 2024 •

edited

Loading

atakaha commented Aug 14, 2024

tthakkal Aug 15, 2024

atakaha Aug 15, 2024

tthakkal Aug 20, 2024

atakaha Aug 21, 2024

HuggingFaceDocBuilderDev commented Aug 29, 2024

atakaha commented Sep 3, 2024

mandy-li commented Sep 3, 2024

atakaha commented Sep 3, 2024 •

edited

Loading

mandy-li commented Sep 5, 2024

jiminha Sep 6, 2024

atakaha Sep 9, 2024

jiminha Sep 9, 2024

atakaha Sep 10, 2024

jiminha Sep 10, 2024

jiminha Sep 6, 2024

atakaha Sep 9, 2024

jiminha Sep 9, 2024

jiminha Sep 10, 2024

atakaha Sep 10, 2024 •

edited

Loading

atakaha Sep 10, 2024

tthakkal Sep 11, 2024

atakaha Sep 11, 2024

jiminha left a comment •

edited

Loading

github-actions bot commented Sep 12, 2024

atakaha commented Sep 12, 2024

atakaha commented Sep 12, 2024

atakaha commented Sep 16, 2024

	return torch.ops.hpu.softmax_fp8(x, dim, None, None, invAttnHead)
	return torch.nn.functional.softmax(x, dim)

Improve MPT fp8 #1256

Improve MPT fp8 #1256

Conversation

atakaha commented Aug 14, 2024 • edited Loading

What does this PR do?

Before submitting

atakaha commented Aug 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 29, 2024

atakaha commented Sep 3, 2024

mandy-li commented Sep 3, 2024

atakaha commented Sep 3, 2024 • edited Loading

mandy-li commented Sep 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atakaha Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiminha left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Sep 12, 2024

atakaha commented Sep 12, 2024

atakaha commented Sep 12, 2024

atakaha commented Sep 16, 2024

atakaha commented Aug 14, 2024 •

edited

Loading

atakaha commented Sep 3, 2024 •

edited

Loading

atakaha Sep 10, 2024 •

edited

Loading

jiminha left a comment •

edited

Loading